image-2.png

Introduction: Home Credit Project by Group 3 - DSEB 62

Overview of the project

Dataset description

image.png

Several notes to effectively run our Jupyter notebook

If the pie chart in Data Analysis section is blank

We use Plotly to effectively visualize the pie chart. If you can not see the plot which is the pie chart, i.e, the output section is totally blank, maybe you have a problem with the current version of plotly or you did not install plotly. The solution is quite simple as follows:

In Annaconda promt:

Wait until this process ends, reinstall it

Hope this instructions work well. Otherwise, please contact us!

To have a clear table of contents of our notebook

We use "jupyter-navbar" extension to have the navigation bar (table of contents) like in the below picture. Please follows these super-easy instruction to have it:

image.png

Imports

We are using a typical data science stack: numpy, pandas, matplotlib, seaborn, plotly.

Load the dataset

There are a total of 10 files: 1 main file for training (with target) 1 main file for testing (without the target), 1 example submission file (no need for our tasks), 1 file description for each columns in each csv file and 6 other files containing additional information about each loan.

EXPLORATORY DATA ANALYSIS

Exploratory Data Analysis (EDA) is an open-ended process where we calculate statistics and make figures to find trends, anomalies, patterns, or relationships within the data. The goal of EDA is to learn what our data can tell us. It generally starts out with a high level overview, then narrows in to specific areas as we find intriguing areas of the data. The findings may be interesting in their own right, or they can help us decide which features to use.

Application_train table

This table is static data for all applications. One row represents one loan in our data sample.

Examine the Distribution of the Target Column

The target is what we are asked to predict: either a 0 for the loan was repaid on time, or a 1 indicating the client had payment difficulties. We can first examine the number of loans falling into each category.

From this information, we see this is an imbalanced class problem. There are far more loans that were repaid on time than loans that were not repaid.

Examine Missing Values

Next we can look at the number and percentage of missing values in each column.

Examine Unique Values

Next we can look at the number and values of unique values in each column.

Describe the statistics of all columns

Describe some basic statistics such as frequency, mean, std, IQR, min, max,...

The histogram of each column

A frequency distribution shows how often each different value in a set of data occurs. A histogram is the most commonly used graph to show frequency

We make use of the histogram for each column in each dataframe provided by Kaggle: https://www.kaggle.com/competitions/home-credit-default-risk/data

Examine the outliers for column

Using boxplot, we can easily see the outliers of each column in the dataset

Correlation of columns

The correlation coefficient is a statistical measure of the strength of a linear relationship between two variables. Its values can range from -1 to 1. Some general interpretations of the absolute value of the correlation coefficent are:

.00-.19 “very weak”

.20-.39 “weak”

.40-.59 “moderate”

.60-.79 “strong”

.80-1.0 “very strong”

We will filter which couple of variables have very strong relationship, whose correlation > 0.9

Because of these variables have very high correlation with another one, we choose to drop one of two variables in a couple. Then, we will drop these following columns in the cleaning part:

"AMT_GOODS_PRICE", "FLAG_EMP_PHONE", "REGION_RATING_CLIENT", "APARTMENTS_MODE", "LIVINGAPARTMENTS_MODE",

"APARTMENTS_MEDI", "LIVINGAPARTMENTS_MEDI", "LIVINGAREA_MEDI", "BASEMENTAREA_MODE", "BASEMENTAREA_MEDI",

"YEARS_BEGINEXPLUATATION_MODE", "YEARS_BEGINEXPLUATATION_MEDI", "YEARS_BUILD_MODE", "YEARS_BUILD_MEDI",

"COMMONAREA_MODE", "COMMONAREA_MEDI", "ELEVATORS_MODE", "ELEVATORS_MEDI", "ENTRANCES_MODE", "ENTRANCES_MEDI",

"FLOORSMAX_MODE", "FLOORSMAX_MEDI", "FLOORSMIN_MODE", "FLOORSMIN_MEDI", "LANDAREA_MODE", "LANDAREA_MEDI",

"APARTMENTS_MODE", "LIVINGAPARTMENTS_MODE", "APARTMENTS_MEDI", "LIVINGAPARTMENTS_MEDI", "LIVINGAREA_MODE",

"LIVINGAREA_MEDI", "TOTALAREA_MODE", "NONLIVINGAPARTMENTS_MODE", "NONLIVINGAPARTMENTS_MEDI", "OBS_30_CNT_SOCIAL_CIRCLE",

"NONLIVINGAREA_MEDI", "NONLIVINGAREA_MODE".

Columns filter

Some columns have only one value which dominated, it takes up to more than 95% of value

Because of these variables have very low variance (the dataset fall mostly on one value). Then, we will drop these following columns in the cleaning part:

'FLAG_MOBIL', 'FLAG_CONT_MOBILE', 'REG_REGION_NOT_LIVE_REGION', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_7'

Because DAYS_EMPLOYED have lot of outliers, we wil drop this column in the later part.

Bureau table

The 'bureau.csv' represents all clients' previous credits provided by other financial instutions that were reported to Credit Bureau (for clients who have a loan in our sample). For every loan in our sample, there are as many rows as number of credits the client had in Credit Bureau before the application date.

Overview over the dataframe

Examine Missing Values

Next we can look at the number and percentage of missing values in each column.

Describe the statistics of all columns

Describe some basic statistics such as frequency, mean, std, IQR, min, max,...

Correlation of columns

The correlation coefficient is a statistical measure of the strength of a linear relationship between two variables. Its values can range from -1 to 1. Some general interpretations of the absolute value of the correlation coefficent are:

.00-.19 “very weak”

.20-.39 “weak”

.40-.59 “moderate”

.60-.79 “strong”

.80-1.0 “very strong”

The highest correlation is 0.88, this represents the strong linear relationship of DAYS_CREDIT and DAYS_ENDATE_FACT.

Bureau_balance table

The 'bureau_balance.csv' represents monthly balance of previous credits in Credit Bureau. This table has one row for each month of history of every previous credit reported to Credit Bureau - i.e the table has (#loans in sample # of relative previous credits # of months where we have some history observable for the previous credits) rows.

Overview over the dataframe

Examine Missing Values

Next we can look at the number and percentage of missing values in each column.

The histogram of each column

A frequency distribution shows how often each different value in a set of data occurs. A histogram is the most commonly used graph to show frequency

Examine the outliers for column

Using boxplot, we can easily see the outliers of each column in the dataset

We can clearly see that, there are no outliers in bureau_balance dataframe.

Credit_card_balance table

Monthly balance snapshots of previous credit cards that the applicant has with Home Credit.

This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample – i.e. the table has (#loans in sample # of relative previous credit cards # of months where we have some history observable for the previous credit card) rows.

A first sight to dataframe

Examine Missing Values

Next we can look at the number and percentage of missing values in each column.

Examine Unique Values of Contract status

Next we can look at the number and values of unique values in categorical column: Contract status

Describe the statistics of all columns

Describe some basic statistics such as frequency, mean, std, IQR, min, max,...

Correlation of columns

The correlation coefficient is a statistical measure of the strength of a linear relationship between two variables. Its values can range from -1 to 1. Some general interpretations of the absolute value of the correlation coefficent are:

.00-.19 “very weak”

.20-.39 “weak”

.40-.59 “moderate”

.60-.79 “strong”

.80-1.0 “very strong”

These couple have correlation > 0.9:

[('AMT_PAYMENT_CURRENT', 'AMT_PAYMENT_TOTAL_CURRENT'), ('AMT_BALANCE', 'AMT_RECEIVABLE_PRINCIPAL'), ('AMT_BALANCE', 'AMT_RECIVABLE'), ('AMT_RECEIVABLE_PRINCIPAL', 'AMT_RECIVABLE'), ('AMT_BALANCE', 'AMT_TOTAL_RECEIVABLE'), ('AMT_RECEIVABLE_PRINCIPAL', 'AMT_TOTAL_RECEIVABLE'), ('AMT_RECIVABLE', 'AMT_TOTAL_RECEIVABLE'), ('CNT_DRAWINGS_CURRENT', 'CNT_DRAWINGS_POS_CURRENT')]

Columns filter

Some columns can be presented by other columns, then in later part, we will drop these columns.

Installments_payments dataframe

A First Sight to Dataset

Examine Missing Values

Next we can look at the number and percentage of missing values in each column.

Missing values statistics

missing_values_installments_payments = missing_values_table(installments_payments) missing_values_installments_payments

Two columns: the day and the amount customers actually paid (DAYS_ENTRY_PAYMENT and AMT_PAYMENT) include the same amount of null values

A bit more than 0.02% of clients haven't paid previous credit on this installment

Describe the statistics of all columns

Describe some basic statistics such as frequency, mean, std, IQR, min, max,...

That the average AMT_INSTALLMENTS (the prescribed installment amount on this installment is) smaller than AMT_PAYMENTS (the amount client actually paid on this installment) suggests:

Some loans are not repaid on time

The histogram of DAYS_INSTALMENT and DAYS_ENTRY_PAYMENT

A frequency distribution shows how often each different value in a set of data occurs. A histogram is the most commonly used graph to show frequency

The histogram of prescribed installment day and actually paid day also suggests late payment of the customer

Correlation of columns

The correlation coefficient is a statistical measure of the strength of a linear relationship between two variables. Its values can range from -1 to 1. Some general interpretations of the absolute value of the correlation coefficent are:

.00-.19 “very weak”

.20-.39 “weak”

.40-.59 “moderate”

.60-.79 “strong”

.80-1.0 “very strong”

Summarization of this table

POS_CASH_balance dataframe

Monthly balance snapshots of previous POS (point of sales) and cash loans that the applicant had with Home Credit. This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample – i.e. the table has (#loans in sample # of relative previous credits # of months in which we have some history observable for the previous credits) rows.

Examine Missing Values

Next we can look at the number and percentage of missing values in each column.

Examine Unique Values

Next we can look at the number and values of unique values in each column.

We focus on the categorical column: Contract status

In this dataset, there is only one categorical feature that indicates the contract status on the previous application. We can see that most previous application are still active.

Examine the outliers for column

Using boxplot, we can easily see the outliers of each column in the dataset

We can see that SK_DPD, SK_DPD_DEF, CNT_INSTALMENT, CNT_INSTALMENT_FUTURE have outlier values

Correlation between variables

Here, we can see that CNT_INSTALMENT and CNT_INTSALMENT_FUTURE are highly correlated (0.87)

Columns filter

Some columns have zeros which dominated, it takes up to more than 97% of value

SK_DPD_DEF vs SK_DPD have a lot of 0 values (nearly 100%), so we will delete these features later.

previous_application table

All previous applications for Home Credit loans of clients who have loans in our sample. There is one row for each previous application related to loans in our data sample.

Examine Missing Values

Next we can look at the number and percentage of missing values in each column.

Examine categorical variables

We can get useful insights from the plots above:

Examine the outliers of 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION', 'DAYS_LAST_DUE'

These columns have too much outliers.

Correlation findings

Here, we can see several interesting things:

We will delete 1 feature in each couple features which have a correlation above |0.9|

Columns filter

Columns with high percentage of missing values:

Columns with high correlation with each other, we will consider to remove one of variable in each couple.

Between:

In conclusion, in the next part, we will remove these columns because of correlation reason:

Other reasons

So that, in the next part, we drop these features:

COLUMNS CLEANING

We drop some columns because of the reasons we stated in the previous part. We do this column-cleaning part first because it will help us reduce lots of heavy work after combining the dataset!

Application_train

Application_test

Bureau, bureau balance and installment_payments

In these tables, we do not have to drop any columns

Credit_card_balance

POS_CASH_balance

Previous_application

UNITE ALL DATASETS

This is kinda tricky as data has different format and is represented differently

Our head dataset will be called 'data' which is just a merge of train and test

Other pieces we will add after some processing

Combine "train" & "test" into "data"

data = train + test

this one is easy: both datasets have exactly the same format with only TARGET column being present in train set as the only difference

Combine 'bureau balance' into 'bureau' into 'data'

bureau balance -> bureau -> data

Before we merge data with bureau, we need to merge bureau dataframe with related information in bureau_balance file

What is the exact problem here:

  1. bureau dataframe comes from the Credit Bureau authority and displays one row for each credit the client from train/test dataset has taken previously. It is matched by SK_ID_CURR with train/test and where in train/test the SK_ID_CURR do not duplicate (1 for 1 client whom we are trying to classify) in most cases bureau dataframe has multipe indicies of the same client as he/she had applied to multiple loans previously.

  2. in turn bureau_balance even more extends the previous credit information on a greater scale. It contains a separate row for each month of history of every previous credit reported to Credit Bureau (bureau dataframe) and is related to bureau df via SK_ID_BUREAU.

So the approach we are going to use is to calculate mean of each statistical column out of these both dataframes to include these mean values as features of our clients whom we are trying to classify. For example: mean days overdue for all credits that the client had previously taken.

I have to say that this approach leaves out some information such as categorical columns in some cases. For example the client with SK_ID_CURR = 666 had 7 credits in bureau dataframe, and when we collapse all these credits (grouped by one ID) into one line to indicate mean values for these credits, we will not be able to show a CREDIT_ACTIVE column that has different categorical values as Closed or Active for different previous credits. So this leaves room for some interesting feature engineering here.

Steps that we need to take:

  1. Collapse bureau_balance dataframe to mean values grouped by SK_ID_BUREAU
  2. Merge this with bureau dataframe
  3. Collapse bureau dataframe to mean values grouped by SK_ID_CURR
  4. Merge what we've got with our data df

Even though we've decided not to perform any feature engineering, one useful feature here is just asking for it. Let's calculate the total number of previous credits taken by each client and include this in our statistics. I believe that kind of information would be quite useful. So let's quickly do that before executing our program defined above

Add a new column PREVIOUS_LOANS_COUNT of each client (SK_ID_CURR) by groupby the table 'bureau'

Now back to merging with all the bureau and bureau_balance information

STEP 1 - collapse bureau_balance

As you can see, this dataframe does not include the bureau_balance categorical column STATUS.

Also note that our formula has changed the name of the SK_ID_BUREAU, we need to change it back in order to use it when merging with bureau df.

One might argue that we didn't need to add this .add_prefix(...) to our formula above, but when working with larger datasets below it will prove itself useful

STEP 2 - merge with bureau

STEP 3 - collapse bureau

Looks good. There are a few missing values although which we will deal with later

STEP 4 - merge bureau with data

So here we've created 13 new features and added them to our train/test dataset called 'data'

The next process is: instalments_payments, credit_card_balance, POS_CASH_balance => previous_application

image.png

Quick information on this block of data: surprisingly... previous_application reflects clients' previous applications for loans to Home Credit. As before, previous_application unfolds in a load of statistics with three other dataframes:

So the plan here would be the following:

  1. Collapse credit_card_balance dataframe to mean values grouped by SK_ID_PREV
  2. Merge with previous_application (our 'leading' dataset in this case)
  1. Collapse POS_CASH_balance to mean values grouped by SK_ID_PREV
  2. Merge with previous_application
  1. Collapse installments_payments ...
  2. Merge with previous_application
  1. Collapse the resulting previous_application dataset to mean values grouped by SK_ID_CURR
  2. Merge our unfolded previous_application statistics with our data

But before we start, let's check if there are any records in previous_application that are not in our data?

looks good

One more thing! We will delete the SK_ID_CURR from the credit_card_balance / POS_CASH_balance / installment_payments as we do not need this column to be shown as mean, this information has no impact on statistics and will just clutter the space as noise. We will group them with our 'leading' dataset previous_application using SK_ID_PREV and our 'leading' dataset has this SK_ID_CURR key to be further mapped with our data.

As previously, before tearing apart the previous_applications to Home Credit statistics, let's extract the number of previous applications of the clients to Home Credit and add this feature to our data

Now back to our process

Combine 'credit_card_balance' into 'previous_application'

STEP 1 - collapse credit_card_balance

STEP 2 - merge with previous_application

Combine 'installments_payments' into 'previous_application'

STEP 3 - collapse installments_payments

STEP 4 - merge with previous application

Combine 'POS_CASH_balance' into 'previous_application'

STEP 5 - collapse POS_CASH_balance

STEP 6 - merge with previous_application

Combine 'previous_application' into 'data'

STEP 7 - collapse the resulting previous_application dataset to show mean values grouped by SK_ID_CURR

STEP 8 - merge what we've got with our data

As we can see, this last sprint over previous applications added 32 new features to our statistics and completed the unification of all data

Summary so far:

(We use Vietnamese to fully explain how to combine all dataset)

Việc combine toàn bộ dataset được thực hiện như sau:

Trước tiên là ghép application_train và application_test lại ta được dataframe tổng tên là: data

Tiếp theo, việc combine dataset được chia thành 2 nhánh: Nhánh trái và Nhánh phải

Dù ở nhánh nào thì ý tưởng chung đều được tóm gọn trong câu sau: 'Muốn cắm bảng B vào bảng A bằng key K (được gợi ý sẵn trên graph) thì phải tạo ra key K tồn tại unique ở bảng B.'

Nhánh trái:

Nhánh phải:

Lưu ý 1:

Lưu ý 2: Tiền tố có trong các cột sẽ giúp ta nhận biết cột đó đến từ bảng nào:

PREPROCESS DATA

Split into train and test set

Perform split according to IDs in initial train and test datasets

Thấy số dòng bằng nhau nên đã đúng, nhưng kiểm tra thêm cho chắc: 10001 với 10005 ở tập test, và giờ train set sẽ ko có giá trị TARGET nào bị null

Everything is touching-good-ly good now!

Filling missing values

Fill missing values in train dataset

So, there is no column has 'category' type in "train" dataset

Fill missing values in 'test' dataset

So, there is no column has 'category' type in "test" dataset

So, no more missing values in both "train" and "test" dataset

Outlier Identification and Removal

We can see some insights:

There are a lot of columns which have a lot of duplicated values.

Train dataset after preprocessing data

Test dataset after preprocessing data

FEATURE CONSTRUCTION

From features originally from application table

DAYS_LAST_PHONE_CHANGE

Lý do giá trị trong DAYS_LAST_PHONE_CHANGE mang số âm là vì dữ liệu được lưu tại thời điểm trong quá khứ so với thời gian nộp hồ sơ. Ví dụ, người đi nộp hồ sơ tại năm 2022, nếu thay đổi số điện thoại năm 2010 thì dữ liệu sẽ được lưu thành (2010 – 2022) * 365 = -4380.

Ta sẽ biến đổi DAYS_LAST_PHONE_CHANGE sang số năm bằng cách chia cho -365 ngày. Khi đó, ta sẽ có phân bố mới như bên dưới.

DELIQUENCIES

It is very important to see how many times clients was late with payments or defaulted his loans. I suppose info about his social circle is also important. I'll divide values into 3 groups: 0, 1 and more than 1

EXT_SOURCE

From features originally from Bureau (Balance)

Lý do giá trị trong PREV_BUR_MEAN_DAYS_CREDIT_ENDDATE, PREV_BUR_MEAN_DAYS_CREDIT, PREV_BUR_MEAN_DAYS_ENDDATE_FACT mang số âm là vì dữ liệu được lưu tại thời điểm trong quá khứ so với thời gian nộp hồ sơ. Ví dụ, người đi nộp hồ sơ tại năm 2022, nếu tín dụng hết hạn năm 2010 thì dữ liệu sẽ được lưu thành (2010 – 2022) * 365 = -4380.

Ta sẽ biến đổi PREV_BUR_MEAN_DAYS_CREDIT_ENDDATE, PREV_BUR_MEAN_DAYS_CREDIT, PREV_BUR_MEAN_DAYS_ENDDATE_FACT sang số năm bằng cách chia cho -365 ngày.

From features from previous_application group of tables

DATA ANALYSIS: INSIGHTS FINDING

From features originally or extracted from application table

Some new domain features

Most of people have enough income to pay for their credit. Some people even have 8 times more than the amount of the loan

The most frequent percentage of annuity per income is below 25%. That is, most of client have to spend at about 1/4 of their income to pay for the debt.

The credit term is from 0.02 - 0.12, mostly fall at 0.05.

Income of the person is the income of the client divide for the members of his/her family. The most frequent income per person is below 100000.

The adult members in the family is the ones who can make money to pay for the loan. Most of the family have 1 - 2 adults.

Children are the dependents of the adults. The higher this ratio is, the higher burden the adults have.

The credit amount per person in the family mostly fall on 250000.

Original features

Loan repayment

As we can see data is highly imbalanced. Most of the client is able to repay for the debt.

Gender of the client

About more than 66.8% of the client is woman.

Age of the client

The age of the client from 20 to 70. People at the age of 35 - 45 is mostly the client for the debt. In general, in the first phase of the working age, people increase the need for application for a loan from time to time. In the next period from 40 - 50, there is a slight decrease in need for a loan. But after that, from over 50 to the retirement, an average people increase the need for a home loan.

Types of loan

Most of the loans are Cash loans which were taken by applicants. 90% loans are Cash loans.

See if the client have the collateral

Only 32% of client have (at least) a car, which can be a good collateral.

Fortunately, nearly 70% of clients have a collateral like a house or an apartment

In the assessment of loan, we may consider the dependents of the client, that is their children and other family members

There are up to 70% of the client have no kids, 20% of them have only one child.

More than a half of client is in two-member-family

Basic information regarding to job, income source, occupation

Most of the client is the labourers, sales staff or core staff, which are working at the time of application. But many of the client is "Unknown".

Background information of the client

Nearly 75% of the client have the education level: Secondary/secondary special.

The most of client have a house or an apartment.

Most of client are in a marriage, but when apply for the loan, they are unaccompanied.

Amount of income, credit and annuity

The amount of credit, annuity and income are nearly on the same level.

Region rating where the client lives

Most of the client are from the 2nd rating region.

Most of the client live in the area which are not densly populated.

On what day, what time the client start the process

On weekend, people are less likely to start the process.

Clients start the process mostly at 10 AM, then, the later the less processing start.

Flag if client's permanent address does not match work address (1=different, 0=same, at region level). Most of the client have the same work address as the permanent address.

Most of the client have the same work address as the living address.

Matching of the address information

Hầu hết địa chỉ liên lạc và nơi sống là trùng nhau. Khoảng 20% khách hàng có địa chỉ làm việc không trùng với địa chỉ liên hệ hoặc nơi sinh sống => KHOẢNG 1/4 SỐ KHÁCH HÀNG LÀM VIỆC Ở THÀNH PHỐ KHÁC NƠI SINH SỐNG

External resources

Lack of data about state of customer (EMERGENCYSTATE_MODE, HOUSETYPE_MODE,FONDKAPREMONT_MODE) as percentage of unknown value accounts more than a half

Deliquencies

People with late payment (>1 day) seem to be much higher in observations of 60 days interval

Did client provide the document

There are 71% provided document 3

More than 90% people provided document 6 and 8.

Number of enquiries to Credit Bureau about the client before application

Usually more than 80% have no inquiries before an hour, a day, a week before application. From a month, a year, there are more enquiries.

From features originally or extracted from bureau (balance)

How many times was the Credit Bureau credit prolonged

17% did prolong their credit account in the past

Number of previous loans from client

Most people have at least one loan before (mostly 5 loans)

From features extracted from previous_application group of tables

When the installment of previous credit was supposed to be paid and actually paid

=> Both these features have the same distribution and it is skewed to the right

Amount of prescribed and actually paid installment amount of previous credit on this installment

=> Both of these features have the same distribution and it is skewed to the left

The average balance after paying down payment and receving annuity of previous application

The average balance after paying down payment and receiving annuity of previous application of each person is almost from 0 to 20000. We can see that many of them have a balance after these activities (almost around 6000 to 7000).

The average amount credit that the client hasn't paid during the month in total on the previous credit

The average amount credit that the client hasn't paid during the month in total on the previous credit is almost above 50000 (around 65000). The second is that some clients can pay in full (around 0).

The difference between the average of required payment value and the amount that was actually paid.

(The amount that client hasn't paid)

The amount of Installment that client hasn't paid is almost around 0, we can see that the difference between the average of required payment value and the amount that was actually paid is not too much different and it's quite balanced.

PREV_APPL_MEAN_POS_MEAN_PCTG_INSTALMENT_FUTURE_INSTALMENT:

(The percentage of the required installments value that wasn't paid.)

The percentage of the installments value that wasn't paid is mostly from 0.5 to 0.7 (50-70%).. Most clients can pay only nearly a half.

How many days early was the payment

The average number of days early was the Installment payment made is mostly positive, it proves that the number of clients pay before due or on time is a lot -> We can see these clients do not have many problems with economic conditions.

In contrast, a small amount is negative, it proves that there is small number of clients pay lately -> These clients may have problems with economic conditions or there is other factors causing them to delay

Features that affect the TARGET variable

Target variable (1 - repaid loans, 0 - loans that were not repaid)

We divide the columns into 4 groups to easily analyze. We mostly use the kdeplot to find out the difference between 0 and 1 in variable "TARGET" in each features of train data.

Group 1

Some numeric columns which initially from application_train dataframe

Almost all of the features have no effect on the target. Only "AGE" (or "DAYS_BIRTH") have a little effect on the TARGET.

When at the age of 30 - 40, most people have difficulty in paying the debt. Then, the older the better client could pay for the debt. We will dig into it deeper.

There is a clear trend: younger applicants are more likely to not repay the loan! The rate of failure to repay is above 10% for the youngest three age groups and beolow 5% for the oldest age group.

This is information that could be directly used by the bank: because younger clients are less likely to repay the loan, maybe they should be provided with more guidance or financial planning tips. This does not mean the bank should discriminate against younger clients, but it would be smart to take precautionary measures to help younger clients pay on time.

Some categorical columns which initially from application_train dataframe

With the categorical features, we can jump to conclusion one point: Either with any kind of any feature, people mostly are able to pay for the debt.

Some columns which extracted from other columns in application_train dataframe

Actually, there are no effect of any features on the TARGET variables.

Group 2

The KDE plot above shows the probability density function of the continuous data variables and its correlation with TARGET.

We find the features that have the most impact among others are feature ralated to external resource (EXT_SOURCE) through its distribution by TARGET and with corr ~ 0.22 (EXT_SOURCE_mean) and address of customer.

Though correlation between address feature (REG_CITY_NOT_LIVE_CITY, REG_CITY_NOT_WORK_CITY, LIVE_CITY_NOT_WORK_CITY) with TARGET is just slightly higher than other variables, we suppose that it is an important consideration whether the address of clients are the same through the application or not.

Group 3

From the KDE, we can clearly see that four columns PREV_BUR_MEAN_AMT_ANNUITY, PREV_BUR_MEAN_DAYS_CREDIT_UPDATE, PREV_BUR_MEAN_DAYS_CREDIT_ENDDATE, PREV_BUR_MEAN_DAYS_CREDIT slightly affects target.

Group 4

Features which initially belonged to 'previous_application' dataset

We can see that 'PREV_APPL_MEAN_DAYS_DECISION' (The average of days that the decision about previous application made) have a little effect on TARGET , because there are some clearly differences between the two variables TARGET = 0 and TARGET = 1 Other features have no effect on TARGET.

Features which initially belonged to 'credit_card_balance' dataset

These features have the same distribution and the frequency of 'TARGET = 0' is more than 'TARGET = 1'. But they don't have effect on TARGET.

Features which initially belonged to 'instalments_payments' dataset

We can see that both features:

Other features have no effect on TARGET.

Features which initially belonged to 'POS_CASH_balance' dataset

Most of these features don't have effect on the TARGET variables. Only 'PREV_APPL_MEAN_POS_MEAN_MONTHS_BALANCE' (the average of month of balance) has a little affect on TARGET variables, because there are some clearly differences between the two variables TARGET = 0 and TARGET = 1

Features which are new (from Features Construction)

In general, we can see that there are no effect of any features on the TARGET variables.